This week in class we learned about interactive data visualisations using different R libraries including ‘ggiraph’, ‘plotly’, and ‘gganimate’.
While static graphs can offer a lot of insights, interactive graphs offer the user more options to explore graphs. In this week’s Take Home Exercise, we explore Singapore’s Age structure by gender and planning area.
First, we should refer to which libraries are needed for this exercise. The necessary packagaes and their uses are listed below:
We load all of the packages using Professor Kam’s code:
packages = c('ggiraph', 'plotly', 'gganimate',
'patchwork', 'DT', 'tidyverse', 'gifski',
'plyr')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
When visualising data while taking many variables into perspective, it is always useful to build a simple base graph on which we can expand. For this graph, we will only consider one year and will not look into Planning Areas yet.
Let us start with data exploration. The code for uploading the data is seen below.
The two files used in this visualisation are:
The data is taken from official Singapore government statistics. The files can be found here.
Now, let’s see what the data looks like:
[1] "__ Summary __"
'data.frame': 1040592 obs. of 7 variables:
$ PA : chr "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" ...
$ SZ : chr "Cheng San" "Cheng San" "Cheng San" "Cheng San" ...
$ AG : chr "0_to_4" "0_to_4" "0_to_4" "0_to_4" ...
$ Sex : chr "Males" "Males" "Males" "Males" ...
$ TOD : chr "HDB 1- and 2-Room Flats" "HDB 3-Room Flats" "HDB 4-Room Flats" "HDB 5-Room and Executive Flats" ...
$ Pop : int 20 480 220 80 0 0 0 0 20 390 ...
$ Time: int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
NULL
[1] "__ Head __"
PA SZ AG Sex
1 Ang Mo Kio Cheng San 0_to_4 Males
2 Ang Mo Kio Cheng San 0_to_4 Males
3 Ang Mo Kio Cheng San 0_to_4 Males
4 Ang Mo Kio Cheng San 0_to_4 Males
5 Ang Mo Kio Cheng San 0_to_4 Males
6 Ang Mo Kio Cheng San 0_to_4 Males
TOD Pop Time
1 HDB 1- and 2-Room Flats 20 2000
2 HDB 3-Room Flats 480 2000
3 HDB 4-Room Flats 220 2000
4 HDB 5-Room and Executive Flats 80 2000
5 HUDC Flats (excluding those privatised) 0 2000
6 Landed Properties 0 2000
[1] "__ Summary __"
'data.frame': 984656 obs. of 7 variables:
$ PA : chr "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" ...
$ SZ : chr "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" ...
$ AG : chr "0_to_4" "0_to_4" "0_to_4" "0_to_4" ...
$ Sex : chr "Males" "Males" "Males" "Males" ...
$ TOD : chr "HDB 1- and 2-Room Flats" "HDB 3-Room Flats" "HDB 4-Room Flats" "HDB 5-Room and Executive Flats" ...
$ Pop : int 0 10 30 50 0 0 40 0 0 10 ...
$ Time: int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
NULL
[1] "__ Head __"
PA SZ AG Sex
1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
TOD Pop Time
1 HDB 1- and 2-Room Flats 0 2011
2 HDB 3-Room Flats 10 2011
3 HDB 4-Room Flats 30 2011
4 HDB 5-Room and Executive Flats 50 2011
5 HUDC Flats (excluding those privatised) 0 2011
6 Landed Properties 0 2011
As we can see here, the tables are identical except for the time period. Therefore, it is possible to combine them into one large data frame:
'data.frame': 2025248 obs. of 7 variables:
$ PA : chr "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" ...
$ SZ : chr "Cheng San" "Cheng San" "Cheng San" "Cheng San" ...
$ AG : chr "0_to_4" "0_to_4" "0_to_4" "0_to_4" ...
$ Sex : chr "Males" "Males" "Males" "Males" ...
$ TOD : chr "HDB 1- and 2-Room Flats" "HDB 3-Room Flats" "HDB 4-Room Flats" "HDB 5-Room and Executive Flats" ...
$ Pop : int 20 480 220 80 0 0 0 0 20 390 ...
$ Time: int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
Looking at the variables, it becomes clear that we can drop all of them except Age Group (AG), Gender (Sex), Population (Pop), and Year (Time). Moreover, given that we are trying to build an Age-Sex Pyramid, we need to group the data. The function is referenced from here.
We get following tibble:
# A tibble: 798 x 4
# Groups: AG, Sex [38]
AG Sex Time Pop
<chr> <chr> <int> <int>
1 0_to_4 Females 2000 108850
2 0_to_4 Females 2001 107510
3 0_to_4 Females 2002 105310
4 0_to_4 Females 2003 101430
5 0_to_4 Females 2004 99290
6 0_to_4 Females 2005 96890
7 0_to_4 Females 2006 94910
8 0_to_4 Females 2007 94610
9 0_to_4 Females 2008 94780
10 0_to_4 Females 2009 97260
11 0_to_4 Females 2010 95630
12 0_to_4 Females 2011 92590
13 0_to_4 Females 2012 91880
14 0_to_4 Females 2013 89810
15 0_to_4 Females 2014 88850
16 0_to_4 Females 2015 89800
17 0_to_4 Females 2016 91400
18 0_to_4 Females 2017 91760
19 0_to_4 Females 2018 90720
20 0_to_4 Females 2019 90850
21 0_to_4 Females 2020 89720
22 0_to_4 Males 2000 117000
23 0_to_4 Males 2001 115420
24 0_to_4 Males 2002 112880
25 0_to_4 Males 2003 108660
26 0_to_4 Males 2004 105740
27 0_to_4 Males 2005 102740
28 0_to_4 Males 2006 99990
29 0_to_4 Males 2007 99160
30 0_to_4 Males 2008 99340
31 0_to_4 Males 2009 100790
32 0_to_4 Males 2010 99030
33 0_to_4 Males 2011 96030
34 0_to_4 Males 2012 95050
35 0_to_4 Males 2013 93750
36 0_to_4 Males 2014 92700
37 0_to_4 Males 2015 94050
38 0_to_4 Males 2016 95760
39 0_to_4 Males 2017 96210
40 0_to_4 Males 2018 94940
# ... with 758 more rows
For the first graph, we will focus only on the year 2000. We will use ggplot to build it.
plot <- data_grouped %>% ggplot(aes(x=AG, y = Pop, fill = Sex)) +
geom_col(data = data_grouped %>% filter(Sex=="Females", Time == 2000),
fill = "pink") +
geom_col(data = data_grouped %>% filter(Sex=="Males", Time == 2000),
aes(y=Pop*(-1)), fill = "navy") +
scale_y_continuous(name = "Population (in 000's)",
breaks = y_breaks,
labels = y_labels) +
labs(title = "Age Sex Pyramid Singapore", x = "Age Group") +
coord_flip() +
theme_light() +
theme(plot.title = element_text(hjust=0.5))
The code first references the data. We set population on the y-axis and Age on the x-axis. The difference between male and female is going to be shown by Gender (Sex).
Then we draw the bar graph for women and men separately using geom_col. As can be seen, for both of them the data is filtered to the year 2000 and their respective gender.
For the male plot, an aesthetic is added that makes its values negative. This is necessary to let the bars rise in the opposite direction of the female plot. To find out more about this click here.
Next, the axis names are set with scale_y_continuous and labs.
Breaks and labels for the y-axis are also set. How these are determined is shown below.
Lastly, the coordinates are flipped (as we want horizontal bars) with coord_flip() and the layout is set using theme_light and theme. Reference for the themes can be found here.
y_breaks <- seq(-(round_any(max(
filter(data_grouped, Sex=="Females", Time == 2000)$Pop), 100000, f = ceiling)),
round_any(max(
filter(data_grouped, Sex=="Females", Time == 2000)$Pop), 100000, f = ceiling),
50000)
#in this the max value is taken for population and the scale is spun across it
y_labels <- abs(seq(-(round_any(max(
filter(data_grouped, Sex=="Females", Time == 2000)$Pop)/1000, 100, f = ceiling)),
round_any(max(
filter(data_grouped, Sex=="Females", Time == 2000)$Pop)/1000, 100, f = ceiling),
50))
#given that we are dealing with large numbers, the labels will be shown in thousands.
The rounding and seq functions for the breaks and labels were referenced.
The resulting graph is:
On error in the graph immediately catches the eye. The age group 5_to_9 is in the middle of the graph instead of the bottom. This is because R reads the names as text instead of numbers. An easy remedy for this is to add to a 0 in front of the 5. However, to keep format, we will aslo be adding 0s in front of 0, 4, and 9.
data[data$AG=="5_to_9", ]$AG <- "05_to_09"
data[data$AG=="0_to_4", ]$AG <- "00_to_04"
More than this, we will use the plotly library to enable basic interactivity in our graph. In order to do this we will use the ggplotly function. Therefore, our final simple graph looks like this:
An alternative way to make the graph interactive using ggiraph would be:
y_labels <- abs(seq(-(round_any(max(
filter(data_grouped, Sex=="Females", Time == 2000)$Pop)/1000, 100, f = ceiling)),
round_any(max(
filter(data_grouped, Sex=="Females", Time == 2000)$Pop)/1000, 100, f = ceiling),
50))
y_breaks <- seq(-(round_any(max(
filter(data_grouped, Sex=="Females", Time == 2000)$Pop), 100000, f = ceiling)),
round_any(max(
filter(data_grouped, Sex=="Females", Time == 2000)$Pop), 100000, f = ceiling),
50000)
data_grouped$tooltip <- c(paste0("Age: ", data_grouped$AG,
"\n Population: ", format(data_grouped$Pop, big.mark=","),
"\n Gender: ", data_grouped$Sex))
plot <- data_grouped %>% ggplot(aes(x=AG, y = Pop, fill = Sex)) +
geom_col_interactive(data = data_grouped %>% filter(Sex=="Females", Time == 2000),
fill = "pink", aes(tooltip=tooltip)) +
geom_col_interactive(data = data_grouped %>% filter(Sex=="Males", Time == 2000),
aes(y=Pop*(-1), tooltip=tooltip), fill = "navy") +
scale_y_continuous(name = "Population (in 000's)",
breaks = y_breaks,
labels = y_labels) +
labs(title = "Age Sex Pyramid Singapore", x = "Age Group") +
coord_flip() +
theme_light() +
theme(plot.title = element_text(hjust=0.5))
girafe(ggobj = plot,
width_svg = 6,
height_svg = 6*0.70)
Next, we want to expand our graph to be able to show the data for each year. We can do this either using plotly or gganimate.
For the general data prep, we change the breaks and labels for the y-axis. We remove the Time == 2000 filter to make sure our scale is large enough to support the highest overall value any year has.
y_labels <- abs(seq(-(round_any(max(
filter(data_grouped, Sex=="Females")$Pop)/1000, 100, f = ceiling)),
round_any(max(
filter(data_grouped, Sex=="Females")$Pop)/1000, 100, f = ceiling),
50))
y_breaks <- seq(-(round_any(max(
filter(data_grouped, Sex=="Females")$Pop), 100000, f = ceiling)),
round_any(max(
filter(data_grouped, Sex=="Females")$Pop), 100000, f = ceiling),
50000)
Next, we redo the plot using gganimate. In order to do this we add the transition_time function to set Years as the framing variable and the ease_aes function to determine the aesthetics of the animation.
We add a subtitle to labs that will show which year we are in at the moment.
Lastly, we use animate to set the duration of the animation and make sure it is not too fast or too slow.
plot <- data_grouped %>% ggplot(aes(x=AG, y = Pop, fill = Sex)) +
geom_col(data = data_grouped %>% filter(Sex=="Females"),
fill = "pink") +
geom_col(data = data_grouped %>% filter(Sex=="Males"),
aes(y=Pop*(-1)), fill = "navy") +
scale_y_continuous(name = "Population (in 000's)",
breaks = y_breaks,
labels = y_labels) +
labs(title = "Age Sex Pyramid Singapore",
subtitle = "Year: {frame_time}",
x = "Age Group") +
coord_flip() +
theme_light() +
theme(plot.title = element_text(hjust=0.5)) +
transition_time(Time) +
ease_aes('linear', interval = 0.1)
animate(plot,
duration = 20)
However, the animation looks a little shaky. Let’s see if this is better when we use plotly.
Given that we are trying to achieve the same result but with a different library, we do NOT need to change the breaks and labels for the y-axis in this case.
For plotly, we use the same graph we used for gganimate with the exception of the gganimate functions.
Instead, we add a frame and id to the original ggplot graph (Reference 1 and 2 for this step).
More than that, when we initialise the graph with ggplotly we add animation options to make our graph clearer.
plot <- data_grouped %>% ggplot(aes(x=AG, y = Pop, fill = Sex, frame = Time, ids = AG)) +
geom_col(position = "identity", data = data_grouped %>% filter(Sex=="Females"),
fill = "pink") +
geom_col(position = "identity", data = data_grouped %>% filter(Sex=="Males"),
aes(y=Pop*(-1)), fill = "navy") +
scale_y_continuous(name = "Population (in 000's)",
breaks = y_breaks,
labels = y_labels) +
labs(title = "Age Sex Pyramid Singapore",
x = "Age Group") +
coord_flip() +
theme_light() +
theme(plot.title = element_text(hjust=0.5))
ggplotly(plot) %>% animation_opts(
1000, easing = "elastic", redraw = TRUE
)
Now that we have our year options in place, we just need to add the planning area variable to the mix and we will have our final graphs.
Singapore has a lot of Planning Areas. When visualising data, it is vital to distinguish between data that needs to be shown and data that does not. Therefore, visualising all 57 Planning Areas would be too much. Instead, there should be a metric to rank the Planning Areas by importance.
[1] "Ang Mo Kio" "Bedok"
[3] "Bishan" "Boon Lay/Pioneer"
[5] "Bukit Batok" "Bukit Merah"
[7] "Bukit Panjang" "Bukit Timah"
[9] "Central Water Catchment" "Changi"
[11] "Changi Bay" "Choa Chu Kang"
[13] "Clementi" "Downtown Core"
[15] "Geylang" "Hougang"
[17] "Jurong East" "Jurong West"
[19] "Kallang" "Lim Chu Kang"
[21] "Mandai" "Marina East"
[23] "Marina South" "Marine Parade"
[25] "Museum" "Newton"
[27] "North-Eastern Islands" "Novena"
[29] "Orchard" "Outram"
[31] "Pasir Ris" "Paya Lebar"
[33] "Punggol" "Queenstown"
[35] "River Valley" "Rochor"
[37] "Seletar" "Sembawang"
[39] "Sengkang" "Serangoon"
[41] "Simpang" "Singapore River"
[43] "Southern Islands" "Straits View"
[45] "Sungei Kadut" "Tampines"
[47] "Tanglin" "Tengah"
[49] "Toa Payoh" "Tuas"
[51] "Western Islands" "Western Water Catchment"
[53] "Woodlands" "Yishun"
[55] "Not Stated" "Boon Lay"
[57] "Pioneer"
As we can see above, each Planning Area has its own name. One metric, by which we could rank them, is population
# A tibble: 5 x 2
PA Pop
<chr> <int>
1 Bedok 6019760
2 Tampines 5392140
3 Jurong West 5297470
4 Woodlands 4936430
5 Hougang 4526000
Here we can see the 5 Planning Areas with the largest population and their actual population numbers. Let us use the data preparation steps from above to create an identical data frame which only has the data for these 5 Planning Areas.
#filtering the data for the PAs
top5_PA <- data %>% filter(PA=="Bedok"|PA=="Tampines"|
PA=="Jurong West" | PA=="Woodlands"|
PA=="Hougang")
#extracting only the necessary variables and grouping the resultant data frame
top5_PA_graph <- top5_PA[,c("AG","Sex","Pop", "Time", "PA")]
top5_PA_grouped <- top5_PA_graph %>%
group_by(AG, Sex, Time, PA) %>%
dplyr::summarise(Pop = sum(Pop))
#assigning labels and breaks to the population axis
y_labels <- abs(seq(-(round_any(max(
filter(top5_PA_grouped, Sex=="Females")$Pop)/100, 100, f = ceiling)),
round_any(max(
filter(top5_PA_grouped, Sex=="Females")$Pop)/100, 100, f = ceiling),
50))
y_breaks <- seq(-(round_any(max(
filter(top5_PA_grouped, Sex=="Females")$Pop), 10000, f = ceiling)),
round_any(max(
filter(top5_PA_grouped, Sex=="Females")$Pop), 10000, f = ceiling),
5000)
#plotting the Age Sex Pyramid for each PA separately
top5_PA_plot <- top5_PA_grouped %>% ggplot(aes(x=AG,
y = Pop,
fill = Sex,
frame = Time,
ids = AG)) +
geom_col(position = "identity",
data = top5_PA_grouped %>% filter(Sex=="Females"),
fill = "pink") +
geom_col(position = "identity",
data = top5_PA_grouped %>% filter(Sex=="Males"),
aes(y=Pop*(-1)), fill = "navy") +
scale_y_continuous(name = "Population (in 00's)",
breaks = y_breaks,
labels = y_labels) +
labs(title = "Age Sex Pyramid Singapore",
x = "Age Group") +
coord_flip() +
theme_light() +
theme(plot.title = element_text(hjust=0.5)) +
facet_wrap(~PA)
ggplotly(top5_PA_plot) %>% animation_opts(
1000, easing = "elastic", redraw = TRUE
)
As we can see above, each Planning Area has its own Age Sex Pyramid shape.
However, the graph we just created is not very clear and pretty hard to read. Wouldn’t it be nice to see each one of these graphs in full size, but without using up the space we would need for 5 full sized plots?
First, let us reinitialise the data. However, when choosing variables we will include Planning Areas this time.
data_graph <- data[,c("AG","Sex","Pop", "Time", "PA")]
data_grouped <- data_graph %>%
group_by(AG, Sex, Time, PA) %>%
dplyr::summarise(Pop = sum(Pop))
y_labels <- abs(seq(-(round_any(max(
filter(data_grouped, Sex=="Females")$Pop)/100, 100, f = ceiling)),
round_any(max(
filter(data_grouped, Sex=="Females")$Pop)/100, 100, f = ceiling),
50))
y_breaks <- seq(-(round_any(max(
filter(data_grouped, Sex=="Females")$Pop), 10000, f = ceiling)),
round_any(max(
filter(data_grouped, Sex=="Females")$Pop), 10000, f = ceiling),
5000)
Next, we will create plots for all of the different planning areas in one graph. This means adding 10 plots (5 Planning Ares x 2 Genders) to the ggplot.
plot <- data_grouped %>%
ggplot(aes(x=AG, y = Pop, fill = Sex, frame = Time)) +
geom_col(position = "identity",
data = data_grouped %>% filter(Sex=="Females", PA == "Bedok"),
fill = "pink") +
geom_col(position = "identity",
data = data_grouped %>% filter(Sex=="Males", PA == "Bedok"),
aes(y=Pop*(-1)), fill = "navy") +
geom_col(position = "identity",
data = data_grouped %>% filter(Sex=="Females", PA == "Tampines"),
fill = "pink") +
geom_col(position = "identity",
data = data_grouped %>% filter(Sex=="Males", PA == "Tampines"),
aes(y=Pop*(-1)), fill = "navy") +
geom_col(position = "identity",
data = data_grouped %>% filter(Sex=="Females", PA == "Jurong West"),
fill = "pink") +
geom_col(position = "identity",
data = data_grouped %>% filter(Sex=="Males", PA == "Jurong West"),
aes(y=Pop*(-1)), fill = "navy") +
geom_col(position = "identity",
data = data_grouped %>% filter(Sex=="Females", PA == "Woodlands"),
fill = "pink") +
geom_col(position = "identity",
data = data_grouped %>% filter(Sex=="Males", PA == "Woodlands"),
aes(y=Pop*(-1)), fill = "navy") +
geom_col(position = "identity",
data = data_grouped %>% filter(Sex=="Females", PA == "Hougang"),
fill = "pink") +
geom_col(position = "identity",
data = data_grouped %>% filter(Sex=="Males", PA == "Hougang"),
aes(y=Pop*(-1)), fill = "navy") +
scale_y_continuous(name = "Population (in 00's)",
breaks = y_breaks,
labels = y_labels) +
labs(title = "Age Sex Pyramid Singapore", x = "Age Group") +
coord_flip() +
theme_light() +
theme(plot.title = element_text(hjust=0.5))
plot
As we can see, this plot just gives us a pyramid with values that we cannot place and no option of seeing for which Planning Area the pyramid is.
To alleviate this problem we can introduce a drop down menu to choose the Planning Area. We will use the updatemenus argument from the layout function in the plotly library.
We will create so-called buttons. In each of the buttons we will choose which of the 8 plots above will be visible when clicking it. For example, for the Bedok button we will only activate plots 1 and 2 from our graph. These are the female and male graphs filtered for Bedok.
updatemenus <- list(
list(
active = 0,
buttons = list(
list(
label = "Bedok",
method = "update",
args = list(list(visible = c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE)),
list(title = "Age Sex Pyramid Bedok"))),
list(
label = "Tampines",
method = "update",
args = list(list(visible = c(FALSE, FALSE, TRUE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE)),
list(title = "Age Sex Pyramid Tampines"))),
list(
label = "Jurong West",
method = "update",
args = list(list(visible = c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, FALSE)),
list(title = "Age Sex Pyramid Jurong West"))),
list(
label = "Woodlands",
method = "update",
args = list(list(visible = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
TRUE, TRUE, FALSE, FALSE)),
list(title = "Age Sex Pyramid Woodlands"))),
list(
label = "Hougang",
method = "update",
args = list(list(visible = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE)),
list(title = "Age Sex Pyramid Hougang")))
)
))
Now, we are one step away from our final graph. After graphing out our drop down menu (reference), we will also include the plotly year animation that has been discussed earlier.
Moreover, we will add a custom tooltip to the graph.
ggplotly(plot) %>%
animation_opts(1000,
easing = "elastic",
redraw = TRUE) %>%
layout(showlegend=FALSE,
updatemenus=updatemenus)
Thank you for taking the time to read through my second Take Home Exercise. I hope you enjoyed my visualisations.
Distill is a publication format for scientific and technical writing, native to the web.
Learn more about using Distill for R Markdown at https://rstudio.github.io/distill.